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National Combustion Code (NCC) 

Code Description 

- Integrated system of codes for the design & analysis of combustion 
systems 

- Advanced features to meet designers’ requirements for model 
accuracy and turn-around time 

- Industry/government development team 

- Primary flow solver is CORSAIR-CCD 
Fundamental Features at Inception 

- Unstructured mesh 

- Parallel processing 
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NCC Performance Improvement Effort 


• Achieve a 1 5-hour turnaround time with NCC on a large-scale, 
fully reacting combustor simulation by September 1998. 

• The current goal is to achieve a 3-hour turnaround of a full 
combustor simulation (1.3 million elements) using CORSAIR- 
CCD by September 2001. This will represent a 1000:1 
reduction in turnaround time relative to 1992. 
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Benchmark Test Cases 

• Lean direct-injection / multiple Venturi swirler (LDI-MVS) 
combustor 

- -444,000 computational elements 

- Finite-rate chemistry (12 species, 10 steps) 

— All turbulence, species and enthalpy equations turned on 

- Estimated converge at 10K iterations 

• The benchmark geometry to satisfy the NPSS milestones 
should be in the range of 1.3 million elements. 

• A second LDI-MVS test case is also available with -971 ,000 
elements. 
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Benchmark Hardware Platforms 


Hardware Platform 

• IBM SP-2 

- 144 RS6000/590S 


• SGI Origin 2000 

- 64 & 256 250 MHz, R1000 processors 
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Baseline Performance 

• T est case 

- LDI-MVS combustor (444K elements) 

- Finite-rate chemistry (12 species, 10 steps) 

- Platform: IBM SP-2 

• Performance 

- 64 processors 

- 61.4 secs/iteration 

• Estimated convergence in 10,000 iterations for 171 hours. 

• Estimated convergence for a 1 .3 million element combustor is 
512 hours. 
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Significant Performance Improvements 


• Algorithm modifications 

• Code streamlining 

• Deadlock elimination 

• Hardware upgrades 

• IDLM kinetics module 

• SGI FORTRAN I/O library 

• Domain decomposition strategy 
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Algorithm Modifications 

• CORSAIR-CCD uses a four-stage Runge-Kutta algorithm. 

- The convective, viscous and artificial dissipation terms were 
originally computed at each stage. 

• The algorithm was modified: 

— The convective terms continue to be computed at each stage. 

- The viscous and artificial dissipation terms are computed at first 
stage and held constant for the remaining stages. 

• This modification eliminated substantial computation and cut the 
required message passing in half. 
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Performance Following 
Algorithm Modifications 


• T est case 

- LDI-MVS combustor (444K elements) 

- Finite-rate chemistry (12 species, 10 steps) 

- Platform: IBM SP-2 

• Performance 

- 84 processors 

- 28.5 secs/iteration 

• Estimated convergence in 10,000 iterations or 79 hours. 

• Estimated convergence for a 1 .3 million element combustor is 
238 hours. 


2000 NPSS Review 


Performance Profiling Results: 
Code Streamlining 


Chdiff (calculates viscosity and thermal 
conductivity of the gas mixture) 

Chprop (solves for gas-phase temperature 
and update gas-phase specific heat) 

derivatives (calculate the 1st order 
derivatives) 

chmsol (solves the linear systems of 
equation) 

residual_smoothing 

chmscc (calculates the coefficient matrix 
and B vector) 
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54% of time 
spent in two 
chemistry 
routines 


-c 


• 40.1% 

• 13.8% 

• 4.7% 

• 4.4% 

• 4.1% 

• 2.0% 
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Code Streamlining (continued) 


• Streamlined finite-rate chemistry operations 

- Replaced “a**0.25” with “sqrt(sqrt(a))”. 

- Eliminated unnecessary indexing of temporary variables. 

- Relocated some operations to an initialization routine. 

- Several divisions operations were replaced by their multiplicative 
inverse. 
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Performance Following 
Code Streamlining 


• Test case 

- LDI-MVS combustor (444K elements) 

- Finite-rate chemistry (12 species, 10 steps) 

- Platform: IBM SP-2 

• Performance 

- 84 processors 

- 14.8 secs/iteration 

• Estimated convergence in 10,000 iterations or 41 
hours. 

• Estimated convergence for a 1 .3 million element 
combustor is 123 hours. 
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Deadlock Elimination 


• The existing communication scheme was sufficient with a simple 



Performance Following 
Deadlock Elimination 

• Test case 

- LDI-MVS combustor (444K elements) 

- Finite-rate chemistry (12 species, 10 steps) 

- Platform: IBM SP-2 

• Performance 


- 96 processors 

- 13.0 secs/iteration 


Estimated convergence in 10,000 iterations or 36 
hours. 

Estimated convergence for a 1 .3 million element 
combustor is 108 hours. 
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Hardware Upgrade 


• IBM SP-2 

- 96 processors 

- 13.0 secs/iteration 

- Speedup = -80.4 

- Efficiency = -84% 


SGI Origin 2000 

- 32 processors 

- 10.1 secs/iteration 

- Speedup = 26.3 

- Efficiency = 82% 


• A 1 .3 x improvement in performance was realized 
by switching to the SGI Origin. 

• Estimated convergence for a 1 .3 million element 
combustor is 84 hours. 
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Hardware Upgrade 


• IBM SP-2 

- 32 processors 

- 34.4 secs/iteration 

- Speedup = -30.4 

- Efficiency = -95% 


• SGI Origin 2000 

- 32 processors 

- 10.1 secs/iteration 

- Speedup = 26.3 

- Efficiency = 82% 


A 3.4 x improvement in performance was realized 
when comparing 32 processor results on the SGI 
Origin. 
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ILDM Kinetics Module 


• Intrinsic low-dimensional manifold (ILDM) 

• Replaced the existing finite-rate chemistry module 

- Solve two scalar equations rather than 12 equations for species. 

— Species are obtained from the ILDM tables. 

- Properties such as density, viscosity, temperature can be obtained 
from ILDM tables. 

- Computation and message passing cost are reduced considerably. 


i 



Performance with the 
ILDM Kinetics Module 

• Test case 

- LDI-MVS combustor (444K elements) 

- ILDM Kinetics Module 

- Platform: SGI Origin 2000 

• Performance 

- 32 processors 

- 2.1 secs/iteration 

• Estimated convergence in 10,000 iterations or 6 
hours. 

• Estimated convergence for a 1 .3 million element 
combustor is 18 hours. 
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SGI FORTRAN I/O Library 


• Scaling improved by switching to SGI f90 compiler. 

Performance did not change when using <= 32 processors. 

- Performance improved when using > 32 processors. 

- Initialization time decreased dramatically. 

• The SGI f90 I/O library handled multiple processes accessing 
the same file much more efficiently than the SGI f77 I/O library. 

Each process was printing a residual to the standard output. 



Domain Decomposition Strategy 


METIS* grid partitioning tool (Univ. of Minnesota) was used to 
provide an alternative domain decomposition strategy for NCC. 

— The interface between processes is minimized. 

- Each process communicates with more of its neighbors, but the 
size of each message is much smaller. 

Code scalability is greatly improved on the Origin 2000, allowing 
an increase in the number of processors being used efficiently. 


Metis is a Greek word meaning ‘wisdom.’ 
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Performance with the 
METIS Grid Partitioning Tool 


• Test case 

- LDI-MVS combustor (444K elements) 

- ILDM kinetics module 

- Platform: SGI Origin 2000 

• Performance 

- 96 processors 

- 0.69 secs/iteration 

• Estimated convergence in 10,000 iterations or 1.9 
hours. 

• Estimated convergence for a 1.3 million element 
combustor is 5.8 hours. 

2000 NPSS Review 


Performance with the 
METIS Grid Partitioning Tool 

• T est case 

- LDI-MVS combustor (971 K elements) 

- ILDM kinetics module 

- Platform: SGI Origin 2000 

• Performance 

- 96 processors 

- 1 .37 secs/iteration 

• Estimated convergence in 10,000 iterations or 3.8 
hours. 

• Estimated convergence for a 1 .3 million element 
combustor is 5.1 hours. 
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National Combustor Code (NCC) 
Performance Timeline 


• The current goal is to achieve a three-hour turnaround of a full 
combustor simulation (1.3 million elements) using CORSAIR- 
CCD by September 2001. This will represent a 1000:1 reduction 
in turnaround time relative to 1992. 

• 1 992: Estimated time to solution was 3,072 hours. 

• 1995: Time to solution was 500 hours. 

• 1 999: Time to solution was 9 hours. 

• 2000: Time to solution is 6 hours. 

• Currently at 512:1 turnaround time. 

! 
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Future Work Planned 


Investigate mixing message passing with shared memory 
programming to enable using additional processors more 

efficiently. 

- Continue to use MPI for existing domain-level, coarse-grained 
parallelism. 

- Investigate using OpenMP for loop-level parallelism. 
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